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Abstract 


We present a novel learning method for 
word embeddings designed for relation 
classification. Our word embeddings are 
trained by predicting words between noun 
pairs using lexical relation-specific fea¬ 
tures on a large unlabeled coipus. This al¬ 
lows us to explicitly incoiporate relation- 
specific information into the word embed¬ 
dings. The learned word embeddings are 
then used to construct feature vectors for 
a relation classification model. On a well- 
established semantic relation classification 
task, our method significantly outperforms 
a baseline based on a previously intro¬ 
duced word embedding method, and com¬ 
pares favorably to previous state-of-the-art 
models that use syntactic information or 
manually constructed external resources. 

1 Introduction 


Automatic classification of semantic relations has 
a variety of applications, such as information ex¬ 
traction and the construction of semantic net¬ 
works ( Girju et al, 2007 1 |Hendrickx et al., 20101 ). 
A traditional approach to relation classification is 
to train classifiers using various kinds of features 
with class labels annotated by humans. Carefully 
crafted features derived from lexical, syntactic, 
and semantic resources play a significant role in 
achieving high accuracy for semantic relation clas¬ 
sification (Rink and Harabagiu, 20101. 

In recent years there has been an increas¬ 
ing interest in using word embeddings as an 
alternative to traditional hand-crafted features. 
Word embeddings are represented as real-valued 
vectors and capture syntactic and semantic sim¬ 


ilarity between words. For example, word2ve$\ 
dMikolov et al., 2013b| ) is a well-established 
tool for learning word embeddings. Although 
word2vec has successfully been used to learn 
word embeddings, these kinds of word embed¬ 
dings capture only co-occurrence relationships 
between words (Levy and Goldberg, 2014 1 . 
While simply adding word embeddings trained 
using window-based contexts as additional 
features to existing systems has proven valu¬ 
able dTurian et al., 2010| ), more recent studies 
have focused on how to tune and enhance word 


embeddings for specific tasks (Bansal et al., 2014 

Boros et al., 2014; 

Chen et al., 2014 

Guo et al., 2014 

Nguyen and Grishman, 2014) 


and we continue this line of research for the task 
of relation classification. 


In this work we present a learning method for 
word embeddings specifically designed to be use¬ 
ful for relation classification. The overview of 
our system and the embedding learning process 
are shown in Figure |T] First we train word em¬ 
beddings by predicting each of the words between 
noun pairs using lexical relation-specific features 
on a large unlabeled coipus. We then use the word 
embeddings to construct lexical feature vectors for 
relation classification. Lastly, the feature vectors 
are used to train a relation classification model. 


We evaluate our method on a well-established 
semantic relation classification task and compare 
it to a baseline based on word2vec embeddings 
and previous state-of-the-art models that rely on 
either manually crafted features, syntactic parses 
or external semantic resources. Our method sig¬ 
nificantly outperforms the word2vec-based base¬ 
line, and compares favorably with previous state- 


'https://code.google.com/p/word2vec/ 
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Figure 1: The overview of our system (a) and the embedding learning method (b). In the example 
sentence, each of are, caused, and by is treated as a target word to be predicted during training. 


of-the-art models, despite relying only on lexi¬ 
cal level features and no external annotated re¬ 
sources. Furthermore, our qualitative analysis of 
the learned embeddings shows that re-grams of our 
embeddings capture salient syntactic patterns sim¬ 
ilar to semantic relation types. 

2 Related Work 

A traditional approach to relation classification is 
to train classifiers in a supervised fashion using a 
variety of features. These features include lexical 
bag-of-words features and features based on 
syntactic parse trees. For syntactic parse trees, the 
paths between the target entities on constituency 
and dependency trees have been demonstrated 
be useful (Bunescu and Mooney, 2005 


to 


Zhang et al., 2006[ ). On the shared task 

introduced by |Hendrickx et al. (2010| ), 

Rink and Harabagiu (2010) achieved the best 
score using a variety of hand-crafted features 
which were then used to train a Support Vector 
Machine (SVM). 

Recently, word embeddings have become 
popular as an alternative to hand-crafted fea¬ 
tures (Collobert et al., 201 il l. However, one of the 
limitations is that word embeddings are usually 
learned by predicting a target word in its context, 
leading to only local co-occurrence information 
being captured ( [Levy and Goldberg, 2014 ). Thus, 
several recent studies have focused on overcoming 


this limitation. |Le and Mikolov (2014 ) integrated 
paragraph information into a word2vec-based 
model, which allowed them to capture paragraph- 
level information. For dependency parsing, 
Bansal et ah (2014) and |Chen et al. (2014j ) found 


ways to improve performance by integrating 
dependency-based context information into their 
embeddings. [Bansal et al. (2014] ) trained embed¬ 
dings by defining parent and child nodes in de¬ 
pendency trees as contexts. |Chen et al. (2014|) 


introduced the concept of feature embeddings 
induced by parsing a large unannotated cor¬ 
pus and then learning embeddings for the man¬ 
ually crafted features. For information ex¬ 
traction, [Boros et al. (2014[ ) trained word em¬ 
beddings relevant for event role extraction. 


and Nguyen and Grishman (2014) employed word 


embeddings for domain adaptation of relation 
extraction. Another kind of task-specific word 
embeddings was proposed by Tang et al. (2014[ ), 
which used sentiment labels on tweets to adapt 
word embeddings for a sentiment analysis tasks. 
However, such an approach is only feasible when 
a large amount of labeled data is available. 

3 Relation Classification Using Word 
Embedding-based Features 

We propose a novel method for learning word 
embeddings designed for relation classification. 
The word embeddings are trained by predicting 
each word between noun pairs, given the corre¬ 
sponding low-level features for relation classifi¬ 
cation. In general, to classify relations between 
pairs of nouns the most important features come 
from the pairs themselves and the words between 


and around the pairs (Hendrickx et al., 2010). For 
example, in the sentence in Figure [Tj (b) there is 
a cause-effect relationship between the two nouns 
conflicts and players. To classify the relation, the 
most common features are the noun pair ( conflicts, 
























































players ), the words between the noun pair (are, 
caused, by), the words before the pair (the, exter¬ 
nal), and the words after the pair (playing, tiles, 
to, ...)• As shown by Rink and Harabagiu (2010), 
the words between the noun pairs are the most ef¬ 
fective among these features. Our main idea is to 
treat the most important features (the words be¬ 
tween the noun pairs) as the targets to be predicted 
and other lexical features (noun pairs, words out¬ 
side them) as their contexts. Due to this, we expect 
our embeddings to capture relevant features for 
relation classification better than previous models 
which only use window-based contexts. 

In this section we first describe the learning pro¬ 
cess for the word embeddings, focusing on lexical 
features for relation classification (Figure Q] (b)). 
We then propose a simple and powerful technique 
to construct features which serve as input for a 
softmax classifier. The overview of our proposed 
system is shown in Figure [I] (a). 


3.1 Learning Word Embeddings 

Assume that there is a noun pair n = (m, 712 ) in 
a sentence with M m words between the pair and 
M ou t words before and after the pair: 

• w in = (w\ n ,...,w™J , 


W fee/ = 

/ bef 

K , • • 

bef \ 

• 1 W M out ) ’ 

w aft = 

/ aft 

(V 

w aft ) 

■ ’ W M out ) ■ 


Our method predicts each target word wf 1 £ w m 
using three kinds of information: n, words 
around «;■" in w m , and words in and w a f t . 

Words are embedded in a d-dimensional vector 
space and we refer to these vectors as word 
embeddings. To discriminate between words in 
n from those in w m , w^f, and w a f t , we have 
two sets of word embeddings: N £ R dx A0 
and W £ R rfx l w L W is a set of words and Af 
is also a set of words but contains only nouns. 
Hence, the word cause has two embeddings: one 
in N and another in W. In general cause is 
used as a noun and a verb, and thus we expect 
the noun embeddings to capture the meanings 
focusing on their noun usage. This is inspired by 
some recent work on word representations that 
explicitly assigns an independent representation 
for each word usage according to its part- 
of-speech tag (Baroni and Zamparelli, 2010; 
Grefenstette and Sadrzadeh, 2011; 


Hashimoto et al., 2013 
Kartsaklis and Sadrzadeh, 

A feature vector f £ R 2rf ( 2 + C ) x 1 is constructed 
to predict wf 1 by concatenating word embeddings: 


Hashimoto et al., 2014 
2013). 


f = [N(m); N(n 2 ); WOC,);...; W(«£ c ); 

Mout 


tt — T w(«y 

Mout 


M 0 


bef\. 


3 = 1 


M, 


-I - LV - L OUZ 

— y w (wf*)} • 
( 1 ) 


3 = 1 


N(-) and W(-) £ R ,/x 1 corresponds to each word 
and c is the context size. A special NULL token is 
used if i — j is smaller than 1 or i + j is larger than 
Mi n for each j £ { 1 , 2 ,..., c}. 

Our method then estimates a conditional prob¬ 
ability p(w |f) that the target word is a word w 
given the feature vector f, using a logistic regres¬ 
sion model: 


p(u;|f) = ct(W (w) ■ f + b(w)) , (2) 

where W (w) £ R 2rf ( 2 + C ) xl is a weight vector for 
w, b(w ) € R is a bias for w, and o(x) = 1+ ^_ a . 
is the logistic function. Each column vector in 
W £ R 2d ( c+1 ) x l vv l corresponds to a word. That 
is, we assign a logistic regression model for each 
word, and we can train the embeddings using the 
one-versus-rest approach to make p(w % L l |f) larger 
than p(w'\i) for 11 / / wf. However, naively opti¬ 
mizing the parameters of those logistic regression 
models would lead to prohibitive computational 
cost since it grows linearly with the size of the vo¬ 
cabulary. 

When training we employ several procedures 
introduced by [Mikolov et al. (2013b| , namely, 
negative sampling, a modified unigram noise dis¬ 
tribution and subsampling. For negative sampling 
the model parameters N, W, W, and b are learned 
by maximizing the objective function J un iabeled : 

Mi n / \ 

EE log(p(toj n |f)) + ^k)g(l -p(m'|f)) 

n i=1 \ j =1 j 

(3) 

where w' j is a word randomly drawn from the uni¬ 
gram noise distribution weighted by an exponent 
of 0.75. Maximizing Juniabeied means that our 
method can discriminate between each target word 
and k noise words given the target word’s context. 
This approach is much less computationally ex¬ 
pensive than the one-versus-rest approach and has 
proven effective in learning word embeddings. 





















To reduce redundancy during training we use 
subsampling. A training sample, whose tar¬ 
get word is w, is discarded with the probability 
Pd(w) = 1 — \J jf—j, where t is a threshold which 

is set to 10~ 5 and p(w) is a probability corre¬ 
sponding to the frequency of w in the training cor¬ 
pus. The more frequent a target word is, the more 
likely it is to be discarded. To further emphasize 
infrequent words, we apply the subsampling ap¬ 
proach not only to target words, but also to noun 
pairs; concretely, by drawing two random numbers 
r\ and r 2 , a training sample whose noun pair is 
(n ], ri 2 ) is discarded if P,/ (n \) is larger than rq or 
Pd(n 2 ) is larger than r 2 . 

Since the feature vector f is constructed as de¬ 
fined in Eq. dTJ- at each training step, W (w) is 
updated based on information about what pair of 
nouns suiTounds w, what word n-grams appear in 
a small window around w, and what words appear 
outside the noun pair. Hence, the weight vector 
W ('«;) captures rich information regarding the tar¬ 
get word w. 

3.2 Constructing Feature Vectors 

Once the word embeddings are trained, we can use 
them for relation classification. Given a noun pair 
n = ( 711 , 77 , 2 ) with its context words w m , wb e j, 
and w a ft, we construct a feature vector to classify 
the relation between ri\ and 772 by concatenating 
three kinds of feature vectors: 

g n the word embeddings of the noun pair, 

gi n the averaged 77 -gram embeddings between the 
pair, and 

g ou t the concatenation of the averaged word em¬ 
beddings in w f,ef and w a ft- 

The feature vector g u E Jfi' 2 ' 2x1 is the concate¬ 
nation of N(ni) and N(?72): 

gn = [N(m);N(77 2 )] . (4) 

Words between the noun pair contribute to clas¬ 
sifying the relation, and one of the most common 
ways to incorporate an arbitrary number of words 
is treating them as a bag of words. However, word 
order information is lost for bag-of-words features 
such as averaged word embeddings. To incorpo¬ 
rate the word order information, we first define 77- 
gram embeddings hj G ¥P d{ 1 +c ) x 1 between the 


noun pair: 

W« 1 );...;W« C );W(<)]. 


Note that W can also be used and that the value 
used for n is (2c+1). As described in Section [3Tl 
W captures meaningful information about each 
word and after the first embedding learning step 
we can treat the embeddings in W as features for 


the words. Mnih and Kavukcuoglu (2013) have 
demonstrated that using embeddings like those in 
W is useful in representing the words. We then 
compute the feature vector g m by averaging h,: 


gin = 


M in 

wX h ‘ 


( 6 ) 


1=1 


We use the averaging approach since M in depends 
on each instance. The feature vector g; n allows us 
to represent word sequences of arbitrary lengths as 
fixed-length feature vectors using the simple oper¬ 
ations: concatenation and averaging. 

The words before and after the noun pair are 
sometimes important in classifying the relation. 
For example, in the phrase “pour n\ into 772 ”, the 
word pour should be helpful in classifying the re¬ 
lation. As with Eq. (JTJ, we use the concatenation 
of the averaged word embeddings of words before 
and after the noun pair to compute the feature vec¬ 
tor g out € M 2dxl : 


I Mout Mout 

gout = W K &e/ ); £ W(m“ /4 )] . 

° Ut 3 = 1 3 = 1 

(V) 

As described above, the overall feature vector 
e e M 4d ( 2+C ) xl is constructed by concatenating 
g u , gi n , and gout- We would like to emphasize 
that we only use simple operations: averaging and 
concatenating the learned word embeddings. The 
feature vector e is then used as input for a soft- 
max classifier, without any complex transforma¬ 
tion such as matrix multiplication with non-linear 
functions. 

3.3 Supervised Learning 

Given a relation classification task we train a soft- 
max classifier using the feature vector e described 
in Section [T2l For each fc-th training sample with 
a corresponding label If. among L predefined la¬ 
bels, we compute a conditional probability given 





its feature vector e/,: 


4 Experimental Settings 


p{kWk) 


exp(o (l k )) 
Ef=i exp(o(i)) ’ 


( 8 ) 


where o £ M I,xl is defined as o = Se^ + s, and 
S e R Lx4d(2+c) and s G R Lx 1 ^ the softmax 

parameters. o(i) is the v'-th element of o. We then 
define the objective function as: 


A 

Jlabeled = ^ log(p(ffc|e fc )) - -||6>|| 2 . (9) 

fc=l 


K is the number of training samples and A con¬ 
trols the L-2 regularization. 6 = (N. W, W, S, s) 
is the set of parameters and Jubeied is maxi¬ 
mized using AdaGrad ( jDuchi et al., 2011] ). We 
have found that dropout ( |Hinton et al., 20121 ) is 
helpful in preventing our model from over- 
fitting. Concretely, elements in e are ran¬ 
domly omitted with a probability of 0.5 at each 
training step. Recently dropout has been ap¬ 
plied to deep neural network models for natu¬ 
ral language processing tasks and proven effec¬ 


tive ( (Irsoy and Cardie, 2014 , Paulus et al., 2014). 

In what follows, we refer to the above 
method as RelEmb. While RelEmb uses only 
low-level features, a variety of useful features 
have been proposed for relation classification. 
Among them, we use dependency path fea¬ 
tures (Bunescu and Mooney, 2005) based on the 
untyped binary dependencies of the Stanford 
parser to find the shortest path between tar¬ 
get nouns. The dependency path features are 
computed by averaging word embeddings from 
W on the shortest path, and are then concate¬ 
nated to the feature vector e. Furthermore, 
we directly incorporate semantic information us¬ 
ing word-level semantic features from Named 
Entity (NE) tags and WordNet hypernyms, as 


used in previous work (Rink and Harabagiu, 2010 


Socher et al., 2012] |Yu et al., 2014| ). We refer to 
this extended method as RelEmbn ii - Con¬ 
cretely, RelEmbpuLL uses the same binary fea¬ 
tures as in |Socher et al. (2012) . The features 
come from NE tags and WordNet hypernym 
tags of target nouns provided by a sense tag¬ 


ger flCiaramita and Altun, 2006| ). 


4.1 Training Data 


For pre-training we used a snapshot of the 
English Wikipedir0 from November 2013. 
First, we extracted 80 million sentences from 
the original Wikipedia file, and then used 
Enji^ (Miyao and Tsujii, 2008) to automatically 
assign part-of-speech (POS) tags. From the POS 
tags we used NN, NNS, NNP, or NNPS to locate 
noun pairs in the corpus. We then collected 
training data by listing pairs of nouns and the 
words between, before, and after the noun pairs. 
A noun pair was omitted if the number of words 
between the pair was larger than 10 and we 
consequently collected 1.4 billion pairs of nouns 
and their contexts 0. We used the 300,000 most 
frequent words and the 300,000 most frequent 
nouns and treated out-of-vocabulary words as a 
special UNK token. 


4.2 Initialization and Optimization 

We initialized the embedding matrices N and W 
with zero-mean gaussian noise with a variance of 

W and b were zero-initialized. The model pa¬ 
rameters were optimized by maximizing the ob¬ 
jective function in Eq. ([3]) using stochastic gradi¬ 
ent ascent. The learning rate was set to a and lin¬ 
early decreased to 0 during training, as described 
in |Mikolov et al. (2013a| ). The hyperparameters 
are the embedding dimensionality d, the context 
size c, the number of negative samples k, the initial 
learning rate a, and M out , the number of words 
outside the noun pairs. For hyperparameter tun¬ 
ing, we first fixed a to 0.025 and M out to 5, and 
then set d to {50, 100, 300}, c to {1, 2, 3}, and 
k to {5, 15, 25}. 

At the supervised learning step, we initialized S 
and s with zeros. The hyperparameters, the learn¬ 
ing rate for AdaGrad, A, M out , and the number of 
iterations, were determined via 10-fold cross val¬ 
idation on the training set for each setting. Note 
that M out can be tuned at the supervised learning 
step, adapting to a specific dataset. 


■“http : // dumps . wikimedia . org/enwiki/ 

3 Despite Enju being a syntactic parser we only use the 
POS tagger component. The accuracy of the POS tagger is 
about 97.2% on the WSJ corpus. 

4 The training data, the training code, and the learned 
model parameters used in this paper are publicly available at 
http://www.logos.t.u-tokyo.ac.jp/~hassy/publications/c 
































5 Evaluation 

5.1 Evaluation Dataset 


We evaluated our method on the SemEval 2010 
Task 8 data sell ( jHendrickx et al., 2010] ), which 
involves predicting the semantic relations between 
noun pairs in their contexts. The dataset, contain¬ 
ing 8,000 training and 2,717 test samples, defines 
nine classes ( Cause-Effect, Entity-Origin, etc.) for 
ordered relations and one class (Other) for other 
relations. Thus, the task can be treated as a 19- 
class classification task. Two examples from the 
training set are shown below. 


0.025, and investigate several hyperparameter set¬ 
tings. For hyperparameter tuning we set the em¬ 
bedding dimensionality d to {50, 100, 300}, the 
context size c to {1, 3, 9}, and the number of neg¬ 
ative samples A: to {5, 15, 25}. 

5.2.2 SVM-Based Systems 

A simple approach to the relation classification 
task is to use SVMs with standard binary bag- 
of-words features. The bag-of-words features in¬ 
cluded the noun pairs and words between, before, 
and after the pairs, and we used LIBLINEAR0 as 
our classifier. 


(a) Financial [ stress [p, is one of the main causes 
of [divorce] e 2 

(b) The [burst] e x has been caused by water ham¬ 
mer [pressure]E 2 

Training example (a) is classified as Cause- 
Effect(E\, E 2 ) which denotes that E 2 is an effect 
caused by E\, while training example (b) is classi¬ 
fied as Cause-Effect(E 2 , E\) which is the inverse 
of Cause-Effect(Ei, E 2 ). We report the official 
macro-averaged F1 scores and accuracy. 

5.2 Models 

To empirically investigate the performance of our 
proposed method we compared it to several base¬ 
lines and previously proposed models. 

5.2.1 Random and word2vec Initialization 

Rand-Init. The first baseline is RelEmb itself, 
but without applying the learning method on the 
unlabeled coipus. In other words, we train the 
softmax classifier from Section [33] on the labeled 
training data with randomly initialized model pa¬ 
rameters. 

W2V-Init. The second baseline is RelEmb us¬ 
ing word embeddings learned by word2vec. More 
specifically, we initialize the embedding matrices 
N and W with the word2vec embeddings. Re¬ 
lated to our method, word2vec has a set of weight 
vectors similar to W when trained with negative 
sampling and we use these weight vectors as a re¬ 
placement for W. We trained the word2vec em¬ 
beddings using the CBOW model with subsam¬ 
pling on the full Wikipedia corpus. As with our 
experimental settings, we fix the learning rate to 


5.2.3 Neural Network Models 


Socher et al. (2012 ) used Recursive Neural Net¬ 
work (RNN) models to classify the rela¬ 
tions. Subsequently, |Ebrahimi and Dou (20151 
and Hashimoto et al. (20131 proposed RNN mod¬ 
els to better handle the relations. These methods 
rely on syntactic parse trees. 

[Yu et al. (2014] ) introduced their novel Factor- 
based Compositional Model (FCM) and presented 
results from several model variants, the best per¬ 
forming being FCMemb and FCMfull- The for¬ 
mer only uses word embedding information and 
the latter relies on dependency paths and NE fea¬ 
tures, in addition to word embeddings. 


Zeng et al. (2014) used a Convolutional Neu¬ 


ral Network (CNN) with WordNet hypernyms. 
Noteworthy in relation to the RNN-based meth¬ 
ods, the CNN model does not rely on parse trees. 
More recently, dos Santos et al. (2015) ) have intro¬ 
duced CR-CNN by extending the CNN model and 
achieved the best result to date. The key point 
of CR-CNN is that it improves the classification 
score by omitting the noisy class “Other” in the 
dataset described in Section [5TI We call CR-CNN 
using the “Other” class CR-CNNother and CR- 
CNN omitting the class CR-CNNeest- 

5.3 Results and Discussion 

The scores on the test set for SemEval 2010 Task 8 
are shown in Table Q] RelEmb achieves 82.8% of 
FI which is better than those of almost all models 
compared and comparable to that of the previous 
state of the art, except for CR-CNNeest- Note that 
RelEmb does not rely on external semantic fea¬ 
tures and syntactic parse feature^. Furthermore, 


f http://www.csie.ntu.edu.tw/~cj1in/liblinear/. 


_ 'While we use a POS tagger to locate noun pairs, RelEmb 

'http : / /docs . google . com/View?docid=dfvxd4 SttoesSfiatzfflcpHpitiy use POS features at the supervised learn- 





















Features for classifiers 

FI / ACC (%) 

RelEmbpuLL 

RelEmb 

embeddings, dependency paths, WordNet, NE 
embeddings 

83.5 /79.9 
82.8 /78.9 

RelEmb (W2V-Init) 

RelEmb (Rand-Init) 

SVM 

embeddings 
embeddings 
bag of words 

81.8/77.7 
78.2/73.5 
76.5 / 72.0 

SVM 


bag of words, POS, dependency paths, WordNet, 

82.2 / 77.9 

( 

Rink and Harabagiu, 2010 

i 

paraphrases, TextRunner, Google //-grams, etc. 

CR-CNNsest (dos Santos et al., 2015) 

embeddings, word position embeddings 

84.1 /n/a 

FCM full (|Yu et al., 2014 ) 

embeddings, dependency paths, NE 

83.0/n/a 

CR-CNNother (dos Santos et al., 2015) 

embeddings, word position embeddings 

82.7 / n/a 

CRNN (Ebrahimi and Dou, 2015 ) 

embeddings, parse trees, WordNet, NE, POS 

82.7 / n/a 

CNN (Zeng et al., 2014) 

embeddings, WordNet 

82.7 / n/a 

MVRNN (Socher et al., 2012) 

embeddings, parse trees, WordNet, NE, POS 

82.4 / n/a 

FCMemb (Yu et al., 2014) 

embeddings 

80.6 / n/a 

RNN (Hashimoto et al., 2013) 

embeddings, parse trees, phrase categories, etc. 

79.4 / n/a 


Table 1: Scores on the test set for SemEval 2010 Task 8. 


RelEmbpuLL achieves 83.5% of FI. We calcu¬ 
lated a confidence interval (82.0, 84.9) (p < 0.05) 
using bootstrap resampling dNoreen, 1989] ). 


5.3.1 Comparison with the Baselines 

RelEmb significantly outperforms not only the 
Rand-Init baseline, but also the W2V-Init base¬ 
line. These results show that our task-specific 
word embeddings are more useful than those 
trained using window-based contexts. A point 
that we would like to emphasize is that the base¬ 
lines are unexpectedly strong. As was noted by 


Wang and Manning (2012), we should carefully 


implement strong baselines and see whether com¬ 
plex models can outperform these baselines. 


5.3.2 Comparison with SVM-Based Systems 

RelEmb performs much better than the bag-of- 
words-based SVM. This is not surprising given 
that we use a large unannotated corpus and em¬ 
beddings with a large number of parameters. 
RelEmb also outperforms the SVM system of 

i, which demonstrates 
the effectiveness of our task-specific word embed¬ 
dings, despite our only requirement being a large 
unannotated corpus and a POS tagger. 


[Rink and Harabagiu (2010 


5.3.3 Comparison with Neural Network 
Models 

RelEmb outperforms the RNN models. In our pre¬ 
liminary experiments, we have found some un¬ 
desirable parse trees when computing vector rep¬ 
resentations using RNN-based models and such 

ing step. 


parsing errors might hamper the performance of 
the RNN models. 


FCMpuT.L, which relies on dependency paths 
and NE features, achieves a better score than that 
of RElEmb. Without such features, RelEmb out¬ 
performs FCMemb by a large margin. By incor¬ 
porating external resources, RelEmbpuLL outper¬ 
forms FCMfull- 


RelEmb compares favorably to CR-CNNother. 
despite our method being less computationally ex¬ 
pensive than CR-CNNother- When classifying an 
instance, the number of the floating number mul¬ 
tiplications is 4d(2 + c)L in our method since 
our method requires only one matrix-vector prod¬ 
uct for the softmax classifier as described in Sec¬ 
tion ED c is the window size, d is the word em¬ 
bedding dimensionality, and L is the number of the 
classes. In CR-CNNother, the number is ( Dc{d + 
2d')N + DL ), where D is the dimensionality of 
the convolution layer, d' is the position embed¬ 
ding dimensionality, and N is the average length 
of the input sentences. Here, we omit the cost of 
the hyperbolic tangent function in CR-CNNother 
for simplicity. Using the best hyperparameter set¬ 
tings, the number is roughly 3.8 x 10 4 in our 
method, and 1.6 x 10' in CR-CNNother assum¬ 
ing N is 10. dos Santos et al. (2015) also boosted 
the score of CR-CNNother by omitting the noisy 
class “Other” by a ranking-based classifier, and 
achieved the best score (CR-CNNuest)- Our re¬ 
sults may also be improved by using the same 
technique, but the technique is dataset-dependent, 
so we did not incorporate the technique. 
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k = 5 

k = 15 

k = 25 


50 

80.5 

81.0 

80.9 

1 

100 

80.9 

81.3 

81.2 

O 

50 

80.9 

81.3 

81.3 

Z 

100 

81.3 

81.6 

81.7 


50 

81.0 

81.0 

81.5 

3 

100 

81.3 

81.9 

82.2 


300 

- 

- 

82.0 


Table 2: Cross-validation results for RelEmb. 
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k = 5 

k = 15 

k = 25 


50 

80.5 

80.7 

80.9 

1 

100 

81.1 

81.2 

81.0 


300 

81.2 

81.3 

81.2 
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50 

80.4 

80.7 

80.8 

O 

100 

81.0 

81.0 

80.9 

Q 

50 

80.0 

79.8 

80.2 


100 

80.3 

80.4 

80.1 


Table 3: Cross-validation results for the W2V-Init. 


5.4 Analysis on Training Settings 

We perform analysis of the training procedure fo¬ 
cusing on RelEmb. 

5.4.1 Effects of Tuning Hyperparameters 

In Tables [2] and [3] we show how tuning the hyper¬ 
parameters of our method and word2vec affects 
the classification results using 10-fold cross vali¬ 
dation on the training set. The same split is used 
for each setting, so all results are comparable to 
each other. The best settings for the cross vali¬ 
dation are used to produce the results reported in 
Table [Q 

Table [2] shows F1 scores obtained by RelEmb. 
The results for d = 50, 100 show that RelEmb 
benefits from relatively large context sizes. The 
n-gram embeddings in RelEmb capture richer in¬ 
formation by setting c to 3 compared to setting c 
to 1. Relatively large numbers of negative sam¬ 
ples also slightly boost the scores. As opposed 
to these trends, the score does not improve using 
d = 300. We use the best setting (c = 3, d = 100, 
k = 25) for the remaining analysis. We note that 
RelEmbpuLL achieves an FI-score of 82.5. 

We also performed similar experiments for the 
W2V-Init baseline, and the results are shown in 
Table [3j In this case, the number of negative 
samples does not affect the scores, and the best 
score is achieved by c = 1. As discussed in 


gn 


Si, 


61.8 70.2 68.2 81.1 


82.2 


Table 4: Cross-validation results for ablation tests. 


Method 

Score 

RelEmb N 

0.690 

RelEmb W 

0.599 

W2V-Init 

0.687 


Table 5: Evaluation on the WordSim-353 dataset. 


Bansal et al. (2014) , the small context size cap¬ 
tures the syntactic similarity between words rather 
than the topical similarity. This result indicates 
that syntactic similarity is more important than 
topical similarity for this task. Compared to the 
word2vec embeddings, our embeddings capture 
not only local context information using word or¬ 
der, but also long-range co-occurrence informa¬ 
tion by being tailored for the specific task. 

5.4.2 Ablation Tests 

As described in Section 13.21 we concatenate three 
kinds of feature vectors, g n , gi n , and g out , for 
supervised learning. Table [4] shows classification 
scores for ablation tests using 10-fold cross val¬ 
idation. We also provide a score using a sim¬ 
plified version of g m , where the feature vector 
g' n is computed by averaging the word embed¬ 
dings [W(u;l Tl ); W(ml ri )] of the words between 
the noun pairs. This feature vector g[ n then serves 
as a bag-of-words feature. 

Table [4] clearly shows that the averaged n-gram 
embeddings contribute the most to the semantic 
relation classification performance. The differ¬ 
ence between the scores of gi n and g' n shows the 
effectiveness of our averaged n-gram embeddings. 

5.4.3 Effects of Dropout 

At the supervised learning step we use dropout to 
regularize our model. Without dropout, our per¬ 
formance drops from 82.2% to 81.3% of FI on the 
training set using 10-fold cross validation. 


5.4.4 Performance on a Word Similarity Task 

As described in Section HU we have the 
noun-specific embeddings N as well as the 
standard word embeddings W. We evalu¬ 
ated the learned embeddings using a word- 
level semantic evaluation task called WordSim- 
353 (iFinkelstein et al., 2001!). This dataset con- 
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from 

caused 

by 

infection 

included 

was 

full 

of 
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caused 
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the 

subject 

of 


Table 6: Top five unigrams and trigrams with the highest scores for six classes. 


sists of 353 pairs of nouns and each pair has 
an averaged human rating which corresponds to 
a semantic similarity score. Evaluation is per¬ 
formed by measuring Spearman's rank correlation 
between the human ratings and the cosine simi¬ 
larity scores of the embeddings. Table [5] shows 
the evaluation results. We used the best settings 
reported in Table [2] and [3] since our method is de¬ 
signed for relation classification and it is not clear 
how to tune the hyperparameters for the word sim¬ 
ilarity task. As shown in the result table, the noun¬ 
specific embeddings perform better than the stan¬ 
dard embeddings in our method, which indicates 
the noun-specific embeddings capture more use¬ 
ful information in measuring the semantic similar¬ 
ity between nouns. The performance of the noun¬ 
specific embeddings is roughly the same as that of 
the word2vec embeddings. 


5.5 Qualitative Analysis on the Embeddings 

Using the n-gram embeddings h, in Eq. ©, we in¬ 
spect which n-grams are relevant to each relation 
class after the supervised learning step of RelEmb. 
When the context size c is 3, we can use at most 
7-grams. The learned weight matrix S in Sec- 
tion l3.3l is used to detect the most relevant n-grams 
for each class. More specifically, for each n-gram 
embedding (n = 1, 3) in the training set, we com¬ 
pute the dot product between the n-gram embed¬ 
ding and the corresponding components in S. We 
then select the pairs of n-grams and class labels 
with the highest scores. In Table[6]we show the top 
five n-grams for six classes. These results clearly 
show that the n-gram embeddings capture salient 
syntactic patterns which are useful for the relation 
classification task. 


6 Conclusions and Future Work 

We have presented a method for learning word em¬ 
beddings specifically designed for relation classi¬ 
fication. The word embeddings are trained using 
large unlabeled corpora to capture lexical features 
for relation classification. On a well-established 
semantic relation classification task our method 
significantly outperforms the baseline based on 
word2vec. Our method also compares favorably to 
previous state-of-the-art models that rely on syn¬ 
tactic parsers and external semantic resources, de¬ 
spite our method requiring only access to an unan¬ 
notated corpus and a POS tagger. For future work, 
we will investigate how well our method performs 
on other domains and datasets and how relation la¬ 
bels can help when learning embeddings in a semi- 
supervised learning setting. 

Acknowledgments 

We thank the anonymous reviewers for their help¬ 
ful comments and suggestions. 

References 

[Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and 
Karen Livescu. 2014. Tailoring Continuous Word 
Representations for Dependency Parsing. In Pro¬ 
ceedings of the 52nd Annual Meeting of the Associa¬ 
tion for Computational Linguistics (Volume 2: Short 
Papers), pages 809-815. 

[Baroni and Zamparelli2010] Marco Baroni and 
Roberto Zamparelli. 2010. Nouns are Vectors, Ad¬ 
jectives are Matrices: Representing Adjective-Noun 
Constructions in Semantic Space. In Proceedings 
of the 2010 Conference on Empirical Methods in 
Natural Language Processing , pages 1183-1193. 

[Boros et al.2014] Emanuela Boros, Romaric 
Besangon, Olivier Ferret, and Brigitte Grau. 
2014. Event Role Extraction using Domain- 
Relevant Word Representations. In Proceedings 


















of the 2014 Conference on Empirical Methods in 
Natural Language Processing (EMNLP), pages 
1852-1857. 

[Bunescu and Mooney2005] Razvan Bunescu and Ray¬ 
mond Mooney. 2005. A Shortest Path Dependency 
Kernel for Relation Extraction. In Proceedings of 
Human Language Technology Conference and Con¬ 
ference on Empirical Methods in Natural Language 
Processing, pages 724-731. 

[Chen et al.2014] Wenliang Chen, Yue Zhang, and Min 
Zhang. 2014. Feature Embedding for Depen¬ 
dency Parsing. In Proceedings of COLING 2014 , 
the 25th International Conference on Computational 
Linguistics: Technical Papers, pages 816-826. 

[Ciaramita and Altun2006] Massimiliano Ciaramita 
and Yasemin Altun. 2006. Broad-Coverage Sense 
Disambiguation and Information Extraction with a 
Supersense Sequence Tagger. In Proceedings of the 
2006 Conference on Empirical Methods in Natural 
Language Processing, pages 594-602. 

[Collobert et al.2011] Ronan Collobert, Jason Weston, 
Leon Bottou, Michael Karlen, Koray Kavukcuoglu, 
and Pavel Kuksa. 2011. Natural Language Pro¬ 
cessing (Almost) from Scratch. Journal of Machine 
Learning Research, 12:2493-2537. 

[dos Santos et al.2015] Cicero Nogueira dos Santos, 
Bing Xiang, and Bowen Zhou. 2015. Classify¬ 
ing Relations by Ranking with Convolutional Neural 
Networks. In Proceedings of the Joint Conference 
of the 53rd Annual Meeting of the Association for 
Computational Linguistics and the 7th International 
Joint Conference on Natural Language Processing 
of the Asian Cede ration of Natural Language Pro¬ 
cessing. to appear. 

[Due hi et al.2011] John Due hi, Elad Hazan, and Yoram 
Singer. 2011. Adaptive Subgradient Methods for 
Online Learning and Stochastic Optimization. Jour¬ 
nal of Machine Learning Research, 12:2121-2159. 

[Ebrahimi and Dou2015] Javid Ebrahimi and Dejing 
Dou. 2015. Chain Based RNN for Relation Clas¬ 
sification. In Proceedings of the 2015 Conference of 
the North American Chapter of the Association for 
Computational Linguistics: Human Language Tech¬ 
nologies, pages 1244-1249. 

[Finkelstein et al.2001] Lev Finkelstein, Gabrilovich 
Evgeniy, Matias Yossi, Rivlin Ehud, Solan Zach, 
Wolfman Gadi, and Ruppin Eytan. 2001. Placing 
Search in Context: The Concept Revisited. In Pro¬ 
ceedings of the Tenth International World Wide Web 
Conference. 

[Girju et al.2007] Roxana Girju, Preslav Nakov, Vivi 
Nastase, Stan Szpakowicz, Peter Turney, and Deniz 
Yuret. 2007. SemEval-2007 Task 04: Classification 
of Semantic Relations between Nominals. In Pro¬ 
ceedings of the Lourth International Workshop on 
Semantic Evaluations (SemEval-2007), pages 13— 
18. 


[Grefenstette and Sadrzadeh2011] Edward Grefenstette 
and Mehrnoosh Sadrzadeh. 2011. Experimental 
Support for a Categorical Compositional Distribu¬ 
tional Model of Meaning. In Proceedings of the 
2011 Conference on Empirical Methods in Natural 
Language Processing, pages 1394-1404. 

[Guo et al.2014] Jiang Guo, Wanxiang Che, Haifeng 
Wang, and Ting Liu. 2014. Revisiting Embed¬ 
ding Features for Simple Semi-supervised Learn¬ 
ing. In Proceedings of the 2014 Conference on 
Empirical Methods in Natural Language Processing 
(EMNLP), pages 110-120. 

[Hashimoto et al.2013] Kazuma Hashimoto, Makoto 
Miwa, Yoshimasa Tsuruoka, and Takashi 
Chikayama. 2013. Simple Customization of 
Recursive Neural Networks for Semantic Relation 
Classification. In Proceedings of the 2013 Confer¬ 
ence on Empirical Methods in Natural Language 
Processing, pages 1372-1376. 

[Hashimoto et al.2014] Kazuma Hashimoto, Pontus 
Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. 
2014. Jointly Learning Word Representations and 
Composition Functions Using Predicate-Argument 
Structures. In Proceedings of the 2014 Conference 
on Empirical Methods in Natural Language Pro¬ 
cessing (EMNLP), pages 1544-1555. 

[Hendrickx et al.2010] Iris Hendrickx, Su Nam Kim, 
Zornitsa Kozareva, Preslav Nakov, Diarmuid 
O Seaghdha, Sebastian Pado, Marco Pennacchiotti, 
Lorenza Romano, and Stan Szpakowicz. 2010. 
SemEval-2010 Task 8: Multi-Way Classification of 
Semantic Relations between Pairs of Nominals. In 
Proceedings of the 5th International Workshop on 
Semantic Evaluation, pages 33-38. 

[Hinton et al.2012] Geoffrey E. Hinton, Nitish Srivas- 
tava, Alex Krizhevsky, Ilya Sutskever, and Rus¬ 
lan Salakhutdinov. 2012. Improving neural net¬ 
works by preventing co-adaptation of feature detec¬ 
tors. CoRR, abs/1207.0580. 

[Irsoy and Cardie2014] Ozan Irsoy and Claire Cardie. 
2014. Deep Recursive Neural Networks for Compo- 
sitionality in Language. In Advances in Neural In¬ 
formation Processing Systems 27, pages 2096-2104. 

[Kartsaklis and Sadrzadeh2013] Dimitri Kartsaklis and 
Mehrnoosh Sadrzadeh. 2013. Prior Disambiguation 
of Word Tensors for Constructing Sentence Vectors. 
In Proceedings of the 2013 Conference on Empiri¬ 
cal Methods in Natural Language Processing, pages 
1590-1601. 

[Le and Mikolov2014] Quoc Le and Tomas Mikolov. 
2014. Distributed Representations of Sentences and 
Documents. In Proceedings of the 31st Interna¬ 
tional Conference on Machine Learning (ICML-14), 
ICML’14, pages 1188-1196. 

[Levy and Goldberg2014] Omer Levy and Yoav Gold¬ 
berg. 2014. Neural Word Embedding as Implicit 



Matrix Factorization. In Advances in Neural Infor¬ 
mation Processing Systems 27 pages 2177-2185. 

[Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg 
Corrado, and leffrey Dean. 2013a. Efficient Esti¬ 
mation of Word Representations in Vector Space. In 
Proceedings of Workshop at the International Con¬ 
ference on Learning Representations. 

[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, 
Kai Chen, Greg S Corrado, and leff Dean. 2013b. 
Distributed Representations of Words and Phrases 
and their Compositionality. In Advances in Neu¬ 
ral Information Processing Systems 26. pages 3111— 
3119. 

[Miyao and Tsujii2008] Yusuke Miyao and lun’ichi 
Tsujii. 2008. Feature Forest Models for Proba¬ 
bilistic HPSG Parsing. Computational Linguistics, 
34(l):35-80, March. 

[Mnih and Kavukcuoglu2013] Andriy Mnih and Koray 
Kavukcuoglu. 2013. Learning word embeddings 
efficiently with noise-contrastive estimation. In Ad¬ 
vances in Neural Information Processing Systems 
26, pages 2265-2273. 

[Nguyen and Grishman2014] Thien Huu Nguyen and 
Ralph Grishman. 2014. Employing Word Repre¬ 
sentations and Regularization for Domain Adapta¬ 
tion of Relation Extraction. In Proceedings of the 
52nd Annual Meeting of the Association for Compu¬ 
tational Linguistics (Volume 2: Short Papers), pages 
68-74. 

[Noreenl989] Eric W. Noreen. 1989. Computer- 
Intensive Methods for Testing Hypotheses: An In¬ 
troduction. Wiley-Interscience. 

[Paulus et al.2014] Romain Paulus, Richard Socher, 
and Christopher D Manning. 2014. Global Belief 
Recursive Neural Networks. In Advances in Neu¬ 
ral Information Processing Systems 27, pages 2888- 
2896. 

[Rink and Harabagiu2010] Bryan Rink and Sanda 
Harabagiu. 2010. UTD: Classifying Semantic 
Relations by Combining Lexical and Semantic 
Resources. In Proceedings of the 5th International 
Workshop on Semantic Evaluation, pages 256-259. 

[Socher et al.2012] Richard Socher, Brody Huval, 
Christopher D. Manning, and Andrew Y. Ng. 2012. 
Semantic Compositionality through Recursive 
Matrix-Vector Spaces. In Proceedings of the 2012 
Joint Conference on Empirical Methods in Natural 
Language Processing and Computational Natural 
Language Learning, pages 1201-1211. 

[Tang et al.2014] Duyu Tang, Furu Wei, Nan Yang, 
Ming Zhou, Ting Liu, and Bing Qin. 2014. Learn¬ 
ing Sentiment-Specific Word Embedding for Twit¬ 
ter Sentiment Classification. In Proceedings of the 
52nd Annual Meeting of the Association for Compu¬ 
tational Linguistics (Volume 1: Long Papers), pages 
1555-1565. 


[Turian et al.2010] Joseph Turian, Lev-Arie Ratinov, 
and Yoshua Bengio. 2010. Word Representations: 
A Simple and General Method for Semi-Supervised 
Learning. In Proceedings of the 48th Annual Meet¬ 
ing of the Association for Computational Linguis¬ 
tics, pages 384-394. 

[Wang and Manning2012] Sida Wang and Christopher 
Manning. 2012. Baselines and Bigrams: Simple, 
Good Sentiment and Topic Classification. In Pro¬ 
ceedings of the 50th Annual Meeting of the Associa¬ 
tion for Computational Linguistics (Volume 2: Short 
Papers), pages 90-94. 

[Yu et al.2014] Mo Yu, Matthew R. Gormley, and Mark 
Dredze. 2014. Factor-based Compositional Embed¬ 
ding Models. In Proceedings of Workshop on Learn¬ 
ing Semantics at the 2014 Conference on Neural In¬ 
formation Processing Systems. 

[Zeng et al.2014] Daojian Zeng, Kang Liu, Siwei Lai, 
Guangyou Zhou, and Jun Zhao. 2014. Relation 
Classification via Convolutional Deep Neural Net¬ 
work. In Proceedings of COLING 2014, the 25th In¬ 
ternational Conference on Computational Linguis¬ 
tics: Technical Papers, pages 2335-2344. 

[Zhang et al.2006] Min Zhang, Jie Zhang, Jian Su, and 
GuoDong Zhou. 2006. A Composite Kernel to Ex¬ 
tract Relations between Entities with Both Flat and 
Structured Features. In Proceedings of the 21st In¬ 
ternational Conference on Computational Linguis¬ 
tics and 44th Annual Meeting of the Association for 
Computational Linguistics, pages 825-832. 



